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Abstract 

We consider the evaluation of approximate top-k queries from relations with 
a-priori unknown values. Such relations can arise for example in the context of ex- 
pensive predicates, or cloud-based data sources. The task is to find an approximate 
top-k set that is close to the exact one while keeping the total processing cost low. 
The cost of a query is the sum of the costs of the entries that are read from the 
hidden relation. 

A novel aspect of this work is that we consider prior information about the values 
in the hidden matrix. We propose an algorithm that uses regression models at query 
time to assess whether a row of the matrix can enter the top-k set given that only 
a subset of its values are known. The regression models are trained with existing 
data that follows the same distribution as the relation subjected to the query. 

To evaluate the algorithm and to compare it with a method proposed previously 
in literature, we conduct experiments using data from a context sensitive Wikipedia 
search engine. The results indicate that the proposed method outperforms the 
baseline algorithms in terms of the cost while maintaining a high accuracy of the 
returned results. 

1 Introduction 

Databases are traditionally concerned with the task of efficiently retrieving a set of tuples 
that match a given query. However, we can also rank the result set according to some 
scoring function. Often especially the top-scoring tuples of this ranking are of interest. A 
typical example of this are information retrieval systems, as the user of a search engine 
is unlikely to be interested in the entire ranking of results, but the first few items only. 
The idea of top-k processing (see e.g. [331 E21 EHJ ED HI ED]) is to retrieve the k first 
results of this ranking without computing the score for every matching document. Most 
of existing work on top-fc search focuses on the task of retrieving tuples with the highest 
score according to a given scoring function / from a known relation. In general / is 
assumed to be monotonic p3], and usually it is a convex combination of the attribute 
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values. Moreover, since the relation is known, various indexing techniques can be applied 
to speed up the processing. For a more in-depth discussion, see [T^] for an excellent survey 
on the topic. 

In this paper we consider the top-fc problem in a somewhat different scenario that can 
be explained as follows. Suppose we have two people, A and B. The person A has a 
vector of m elements, while the person B has a matrix of n rows and m columns. The 
task of person A is to find those k rows of this matrix that have the highest inner product 
with her vector. However, B will not reveal the matrix to A. Instead, A has to ask for the 
values of individual cells one by one. Moreover, to each cell is associated a cost that A has 
to pay before B reveals the value. These costs are known to A. How should A proceed 
in order to find the k highest scoring rows while keeping the total cost of the process 
low? We also give A some additional information in the form of examples of matrices 
that B has had in the past. Using these as training data A can employ machine learning 
techniques to find out what elements of the matrix to ask for. 

In other words, we must apply a known linear ranking function (the vector of person 
A) to a hidden relation (the matrix of person B) given knowledge about the distribution 
of the values in the hidden relation. This is in contrast to most of existing work where the 
relation is assumed to be known, and the ranking function may vary. We also assume that 
the access costs are high: reading the value of an entry in the matrix is computationally 
expensive. In this paper we propose an algorithm that will find an approximate answer 
to a top-k query while keeping the cost of the query low. 

Since the contents of the hidden relation are unknown at the time the query is issued, 
a solution can not rely on pre-built index structures. We do assume, however, that all 
relations that we will encounter follow the same distribution, and that we can sample 
training data from this distribution. The algorithm that we propose makes use of regres- 
sion models to estimate whether or not a row can belong to the top-k set after having 
observed only a subset of its entries. Moreover, we can decrease the cost of the query 
by allowing a small number of errors in the results. That is, we allow the algorithm to 
return a set of documents that is not the exact top- A; set. The results may miss some high 
scoring documents, and respectively contain other documents that do not belong to the 
exact top- A; set. 

In practice this top- A; problem can be motivated by the following design of a context 
sensitive search engine [27], where a query is assumed to consist of a set of terms and a 
context document. The context document can be e.g. the page currently viewed by the 
user. To process the query we first retrieve all documents that contain the query terms, 
and then describe each of these by a feature vector that is a function of the context. The 
final score of a document is given by the inner product of the feature vector with a scoring 
vector. A context dependant feature could be e.g. a measure of textual similarity between 
the context document and the document that is being ranked. These context dependant 
features could be implemented as expensive predicates. We can thus think that person B 
is hiding the feature vectors, and by computing a feature we are asking for its value. The 
total cost we have to pay reflects the computational overhead associated with finding out 
the values of the features. 
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1.1 Related Work 



We discuss related work from a number of different angles: databases, approximate nearest 
neighbor search, and learning theory. 

1.1.1 Databases 

A considerable amount of literature has been published about top-A; processing in the 
past years. For a thorough review we refer the reader to the survey by Ilyas et. al. [T5] . 
Usually the basic setting in these is slightly different than the one taken in this paper. A 
common assumption is that the data is fixed, and various preprocessing methods can be 
applied. Most well known of this line of research is bound to be the work by Fagin and 
others related to the threshold algorithm [HI [HJ [23] , and its variants, see e.g. [26j 12 ED]- 
The idea is to sort each column of the relation in decreasing order of its values as a 
preprocessing step. As a consequence tuples that have a high total score should appear 
sooner in the sorted lists. We cannot make use of this approach, as it requires reading 
all values of the input relation in order to do the sorting, or alternatively data sources 
that directly provide the columns in sorted order, neither of which we do not have at our 
disposal. 

Considerably more related to this paper is the work of Marian et. al. [21] . where the 
problem of aggregating several web-based data sources is considered. They too assume 
that probing values from the relation(s) is time-consuming and therefore the algorithm 
should aim to minimize the total execution time of the query. Another important reference 
to the current work is the MPro algorithm discussed by Hwang and Chang [H], to which 
the algorithm in [2T] is closely related. The crucial difference to our work is that neither 
[2T] nor [H] consider a similar use of training data, and require an exact top-A; list as the 
result. 

In addition to top-A; processing, we also briefly mention work on classical database 
query optimization. Especially of interest to us is research on optimizing queries with 
expensive predicates [121 HE]- Of course [H] falls to this category as well, as it considers 
top-A; processing under expensive predicates. Here the fundamental question concerns 
finding a query plan to minimize the total execution time given that some (restriction) 
predicates used in the query are computationally expensive. These expensive predicates 
can be e.g. arbitrary user defined functions. Our work can be seen in this framework 
as well. We consider the processing of a query (a type of SELECT) that must return 
all rows of the hidden relation that belong to an approximate top-A; set defined by the 
given scoring function. We can assume that reading one entry of the hidden relation on a 
particular row corresponds to evaluating one expensive predicate for this row. While the 
value of an entry does not directly specify whether or not the row belongs to the top-A; 
set, the algorithm that we propose later in Section H] gives a probabilistic estimate of this 
based on the known entries of a row. 

1.1.2 Approximate nearest neighbors 

Since our ranking is based on the inner product between the rows and a scoring vector, the 
top-A; set is equivalent to the set of k nearest neighbors of the scoring vector if everything 
is normalized to unit length. Algorithms for Aj-NN queries (in high-dimensional Euclidean 
spaces) have been widely studied. In particular, papers related to approximate nearest 
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neighbor search [TB] are of interest in the context of our work. The usual approach in 
these is to reduce the number of required distance computations by preprocessing the set 
of points that is being queried. In locality sensitive hashing [8] the underlying relation 
is indexed using a number of hash functions so that collisions indicate close proximity. 
A query vector is only compared to vectors mapped to the same bucket by the hash 
function. Kleinberg takes a similar approach but uses random projections instead 
of hash functions. A somewhat different preprocessing technique are low dimensional 
embeddings pQ that aim to speed up the processing by representing the set of points in a 
lower dimensional space where distance computations can be carried out faster. Singitham 
et. al. [25] propose a solution based on clustering of the database, where the query vector 
is only compared to points that reside in clusters whose centroid is close to the query 
vector. Recently Goel et. al. [H] propose another technique based on clustering that uses 
the query-distribution together with a variant of the threshold algorithm [5]. 

However, the basic assumption in [HI [191 E21 HI E] and related literature is that the data 
being queried is known a-priori so that indexing techniques can be applied to quickly find 
points that are close to an arbitrary query vector. To see the problem that we discuss in 
this paper as fc-NN search, we have to turn the setting upside down, so that the query 
vector (i.e. our scoring weights) is fixed, and the set of points (i.e. the rows of our hidden 
relation) are sampled from a known distribution. Moreover, an elementary property of 
our problem are the costs associated with reading values from the hidden relation. Such 
assumptions are to the best of our knowledge not made in any of the existing work on 
fc-NN search. 

1.1.3 Learning theory 

Unlike methods for approximate nearest neighbor search, some models in computational 
learning theory take costs for accessing input items into account [21 [T71 [21 [TO] • In general 
this line of work considers ways to evaluate a (boolean) function when the inputs are 
obtained only by paying a price. An algorithm is given a representation of the function 
(e.g. a boolean circuit), and the costs associated with each input. The algorithm must 
learn the value of the function while keeping the cost of the process low. In the simplest 
case the algorithm is merely an ordering of the variables. That is, the function is evaluated 
by reading values of the variables according to a specified order. This approach is studied 
e.g. in [IZ1IT0], while more complex algorithms are considered in [2] and [3]. 

On a high level our problem is similar. We too are concerned with evaluating a function 
(the top-fc query) while trying to minimize the overall cost. Especially the problem of 
finding a good order in which to read the attribute values that we discuss in Section 15.21 
is related. This order is important as it can have a considerable effect on the performance 
of our approach. It would be of interest to see if any of the previous results [21 [T71 [31 [10] 
can be applied in this case, but we consider this to be worth a discussion of its own in 
future work. 

1.2 Our contributions 

We conclude this section with a structure and summary of the contributions of this paper. 

• Section [21 We describe (to the best of our knowledge) a novel top-k search problem. 
The main characteristic of the problem is that instead of applying preprocessing 
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techniques on the items that we are ranking, we have a sample from the same 
distribution to be used as training data for machine learning methods. 

• Section HJ We propose a simple algorithm that finds a set of k rows from a given 
matrix with a high score according to a fixed linear scoring function. The algorithm 
uses two parameters. The first parameter is a threshold value that is used to prune 
items that are unlikely to belong to the top-fc set. The second parameter is an 
ordering of the attributes. 

• Section 15.11 We propose an algorithm for learning a good value of the threshold 
parameter based on training data. 

• Section 15.21 We propose an algorithm for learning a good ordering of the attributes 
based on training data. 

• Section EJ We conduct a set of experiments to demonstrate the performance of our 
algorithm(s). We compare our algorithm to a simple baseline, and another algorithm 
presented previously in [H]. 



2 Basic definitions 

Input matrix Let X be an n x m matrix, an element of which is denoted by Xjj. The 
ith row of X, denoted Xj., represents the ith item that we are ranking. Let Ai, ... , A m be 
a set of m attributes. The values of attribute Aj appear on the jth column of X, denoted 
X.j. For the rest of this paper we will assume that X.y 6 Kq for all i and j. That is, all 
entries of X are nonnegative real numbers. 



Cost of a query To each attribute Aj is associated a cost C(Aj) that represents the 
effort of examining the value X^-. We assume that this computation is equally hard for 
all cells in X.j. The cost of a top-fc query is simply the sum of the costs of all entries that 
our algorithm has to inspect in order to return its output, normalized by the cost of the 
trivial algorithm that computes all entries of X. We have thus 

. n c ( A i) x {*ij is inspected} 
COStfe(X) = nEjCW ' (1) 

where Z{X} is 1 if the statement X is true, and otherwise. 



Scoring and top- A; sets Let w = (wi, . . . , w m ) be a (row) vector of weights. The prefix 
of a vector, denoted w 1: /j, is a /i-dimensional vector consisting of the elements Wi, . . . , w^. 
Likewise, we denote by X^i^ the prefix of the ith row of X. The prefix score of the ith 
item is given by the product X^i.^w^. When we have h = m, the prefix score is the full 
score Xj.w T . The exact top-fc set of X given w, denoted T^(X), consists of the indices 
of the k items with the highest full scores. More formally, we have 

T*(X) = {i | ^ i : Xi ,.w T > Xi.w T }\ < k). 
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The schedule All algorithms that we consider in this paper have the common property 
that attributes on row Xj. are examined sequentially in a certain order, and this order 
is the same for all %. That is, the entry Xy will be read only if all entries Xy/, with 
j' < j, have already been read. We adopt the terminology used in [T3] and call this 
order the schedule. This resembles the order in which a database system would apply 
selection predicates in a serial (as opposed to conditional) execution plan. However, in 
our case the benefit of using one schedule over another is not associated with selection 
efficiency, but having better estimates of the full score given a prefix score. In Section 15.21 
we discuss a number of simple baseline schedules, and also present a method for finding 
a good schedule using training data. Different choices for the schedule are compared in 
the empirical section. 

Accuracy of an approximate result The algorithm we propose in this paper is not 
guaranteed to return the exact top-k set. Denote by T^(X) the k highest scoring items 
returned by an inexact top-k algorithm. We report the accuracy of such an approximate 
top- A; list as the fraction of items in T^(X) that also belong to the exact set T^(X). More 
formally, we have 



Problem setting The basic objective of this paper is to devise an algorithm that finds 
an approximate top-fc set with high accuracy at a low cost. This can be formalized as a 
computational problem in a number of ways. The simplest approach is to assume there is 
an external constraint in the form of a budget x on the costs, or a requirement y on the 
accuracy. Then we could devise algorithms that maximize accuracy given that the cost 
can be at most x, or minimize the cost given that the accuracy has to be at least y. The 
approach we take in this paper is more pragmatic, however. We discuss an algorithm that 
uses two parameters, both of which affect accuracy and cost. While we give no analytical 
guarantees about the performance, we develop methods to systematically find good values 
for these parameters, where goodness is measured by using accuracy and cost as defined 
above. 

3 Baseline algorithms 

We compare the algorithm presented in this paper with two baseline methods. The first 
one makes use of a simple branch-and-bound strategy, while the second one is the MPro 
algorithm [IJ]. Unlike the proposed algorithm, they do not need training data and can be 
applied in a traditional top-fc setting. Instead they rely on upper bounds, denoted U(Ai), 
for the values of each attribute A^. These can be based either on prior knowledge of the 
attribute domains, or alternatively on a separate training data. Combined with the prefix 
score of the row Xj., we can use these to upper bound the full score of Xj.. More formally, 
denote by Uh{i) an upper bound for the full score Xj.w T given the prefix score X^i^w 11 
and the upper bounds for the attributes outside the prefix. We have thus 




(2) 



U h (i)=X iil:h wZ h + u ( A i)- 



i=h+l 
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3.1 Simple upper bounding 

A very straightforward approach to our top- A; problem is the following: consider the upper 
bound Uh(i) for the row % after computing the values in a prefix of length h. If this upper 
bound is below the full score of the lowest ranking item of the current top-k list, we know 
that Xj. can not belong to the final top-k list. Therefore it is not necessary to compute 
the remaining values, and we can skip the row. 

To apply this heuristic, we need to first get a candidate top-fc set. This we obtain by 
reading all values of the first k rows of X, and computing their full scores. Denote by 5 
the lowest score in the current top-fc set. For the remaining rows of X, we start computing 
the prefix score, and each time a new attribute is added to the prefix, we check the value 
of Uh(i)- If it is below the current value of 5, we skip the rest of Xj., if not, we examine 
the value of the next attribute. Once all attributes for a row have been computed, we 
know its full score, and can determine whether or not it enters the current top-fc list. If it 
does, we update 5 accordingly. In the remaining of this paper we call this algorithm the 
UB algorithm. 

The performance of this method depends on how rapidly 5 reaches a level that leads 
to efficient pruning. Obviously when 5 is small the value of Uh{i) will always be larger. 
We can improve the efficiency of the method with the following heuristic: Note that the 
value of A\ is always computed for every row. This is because Uo(i) is always larger than 
any possible 5, so nothing will be pruned at this point. We can thus compute all values 
in the column X.x, and rank the rows of X in decreasing order of this without sacrificing 
anything in the final cost. After sorting our initial top-fc list will contain rows that have 
a high value at least in the first attribute. They are thus somewhat more likely to have a 
high full score than randomly chosen rows. 

3.2 The MPro algorithm 

The MPro algorithm of [H] can be seen as the well known A* algorithm j2H p.97ff] 
adopted for the top-fc query problem. Like the UB algorithm, it also computes entries 
in X in a left-to-right fashion, i.e., the algorithm does not access Xjj unless the value 
Xjj/ has been read for all j' < j. For every row X,. the algorithm maintains the upper 
bound Uhii)- The rows are stored in a priority queue Q with Uh(i) as the key, i.e., the 
the first row in the queue is the one with the highest upper bound. The algorithm pops 
rows from Q one by one, computes the next unknown entry, updates the upper bound 
and inserts the row back into Q, or outputs it as a member of the top-fc set if all values 
have been computed. When the output size reaches k, the algorithm terminates. As with 
the UB algorithm, as a first step the value of the attribute A\ is computed for all rows 
to compute the initial values of the upper bounds. These are used to initialize Q. In the 
remaining of this paper, we call this algorithm the MP algorithm. 

4 An algorithm based on prior knowledge 

In this section we describe a method that finds k high scoring rows of a given matrix X 
using a fixed scoring vector w. A difference to the baseline methods is that the algorithm 
requires prior knowledge of the distribution of the values in X. In practice this means we 
need training data in form of one or several matrices X' that are drawn from the same 
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distribution as X. The algorithm has two parameters that can be adjusted to tune its 
performance. We also provide algorithms for finding good values for these parameters 
from training data. 

4.1 Algorithm outline 

On a high level the algorithm is based on the same basic principle as the UB algorithm. We 
scan the rows of X one by one and incrementally compute the prefix score for each row. 
This is done until we can discard the remaining entries of the row based on some criterion, 
or until we have computed the full score. If we decide to skip the row based on a prefix 
score, we never return to inspect the remaining entries of the same row. However, unlike 
with the UB or MP algorithms, we are not using simple upper bounds for the remaining 
attributes. Instead we use the training data X' to learn a model that allows us to estimate 
the probability that the current row will enter the current top-k set given the prefix score. 
If this probability is below a given threshold value, we skip the row. 

Suppose that we currently have a candidate set of top-A; rows. Denote by S the 
lowest score in the candidate set, and let Xj. be the row that the algorithm is currently 
considering. Given a prefix score of Xj., we can give an estimate for the full score Xj.w T , 
and make use of this together with S to decide whether or not it is worthwhile to compute 
the remaining, still unknown values of Xj.. More precisely, we want to estimate the 
probability that Xj. would enter the current top-/c set given the prefix score Xj ^w^, 
that is 

Pr (X. 4 .w T > 5 | Xi,i: h wZ h ) . (3) 

If this probability is very small, say, less than 0.001, it is unlikely that Xj. will ever enter 
the top-fc set. In this case we can skip Xj. without computing values of its remaining 
attributes. Of course this strategy may lead to errors, as in some cases the prefix score 
may give poor estimates of the full score, which in turn causes the probability estimates 
to be incorrect. The details of estimating Equation [3] are discussed in Section fl~2l 

An outline of the PR algorithm we propose is given in Algorithm [TJ It uses a parameter 
a that determines when remaining entries on the row Xj. are to be skipped. Whenever 
we have Pr(Xj.w T > 5 | Xj^^w^) < a we proceed with the next row. Selecting an 
appropriate value of a is discussed in Section 15.11 As with the baseline algorithms, we 
also need an order, the schedule, in which to process the attributes. This is the 2nd 
parameter of our algorithm. In Section 15.21 we describe a number of simple baseline 
schedules, and also propose a method that uses training data to learn a good schedule for 
the PR algorithm. 

4.2 Estimating the probabilities 

The most crucial part of our algorithm is the method for estimating the probability 
Pr(Xj.w T > 5 | X 

i,i:h^i-h)- I n short, the basic idea is to estimate the distribution of 
Xj.w T given the prefix score Xj ^w^. We do this by learning regression models that 
predict the parameters of this distribution as a function of the prefix score. Together with 
S the desired probability can be found out using this distribution. The details of this are 
discussed next. 

The basic assumption of this paper is that the distribution of the full score Xj.w T 
given a fixed prefix score is Gaussian. We acknowledge that this may not be true in 
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Algorithm 1 (The PR algorithm) 
Input: the n x m matrix X, parameter a G [0, 1] 
Output: an approximate top-/c set 
1: Compute all values in column X.i and sort the rows of X in decreasing order of this 
value. 

2: Ck ^— {Xi., . . . , Xfc.} 
3: 5 <- min xeCfc {xw T } 
4: for i = k + 1 to n do 
5: h -G- 1 

6: while h < m and Pr(Xj.w T > 5 | X^i^w^J > a do 

7: h G- h + 1 

8: Compute the value X^. 

9: end while 

10: if h = m and Xj.w T > 5 then 

11: C fc G- C fc \ arg min xeCfc {xw T } 

12: Cfc ^— Cfc U Xj. 

13: <5 <- min x6Cfe {xw T } 

14: end if 

15: end for 

16: return Ck 



general. However, according to the central limit theorem, as the number of attributes 
increases, their sum approaches a normal distribution as long as they are independent. 
(The attributes need not follow the same distribution as long as they are bounded, see the 
Lindeberg theorem [3 page 254].) Of course the attributes may not be independent, and 
also their number may not be large enough to fully warrant this argument in practice. 
Nonetheless, we consider this a reasonable first step. 

By convention, we denote the parameters of the normal distribution by /i and a, 
where [i is the mean and a the standard deviation. Furthermore, we assume that both \i 
and a depend on the prefix score, and we must account for prefixes of different lengths. 
Denote by Sh a prefix score that is based on the first h attributes. The assumption is that 
Xj.w T ~ N(fi(sh), &(sh)) ■ Once we have some estimates for fi(sh) and a(sh), we simply 
look at the tail of the distribution and read the probability of X,.w T being larger than a 
given 5. To learn the functions /i(sh) and a(sh) we use training data. For every possible 
prefix length h, we associate the prefix score of the row Xj. with the full score of Xj.. That 
is, our training data consists of the following set of ( "prefix score" , "full score" ) pairs for 
every h: 

X h = {{^wl^.w^}^. (4) 

Now we have to estimate fi(sh) and <j(sh)- One approach is to use binning. Given Sh 
and X h , we could compute the set 

B(s h ) = {b | a E Bin(s) A (a, b) G X h } 

that contains full scores of objects that have a prefix score belonging to the same bin as 
Sh. The bins are precomputed in advance by some suitable technique. Now we can define 
n{sh) and cr(sh) simply as their standard estimates in B(sh)- This approach has some 
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drawbacks, however. First, we need to store the sets X^ for every h. This might be a 
problem if n and m are very large. Whereas if n is small, we either have to use large bins, 
which leads the estimates being only coarsely connected to Sh, or use narrow bins with 
only a few examples in each, which will also degrade the quality of the estimators. 

To remedy this we use an approach based on kernel smoothing (28]. Instead of fixed 
bins, we consider all of X h when computing an estimate of jii(sfc) or cr(s ft ). The idea is 
that a pair (a, b) G X h contributes to the estimates with a weight that depends on the 
distance between the prefix scores a and s^. The pair contributes a lot if a is close to s h , 
and only a little (if at all) if the distance is large. Denote by .fT : IR x IR — >■ R a kernel 
function. For the rest of this paper we let 



where (3 is a parameter. Other alternatives could be considered as well, the proposed 
method is oblivious to the choice of the kernel function. 

Using K we can define the kernel weighted estimates for n(sh) and cr(sh)- We let 



that is, any full score Xj.w T contributes to fi(sh) with the weight K(X. it i : hwJ. h , Sh)- The 
nice property of this approach is that it can be also used to estimate the standard deviation 
of the full score at Sh by letting 



The above equation is a simple variation of the basic formula VarLY] = E[X 2 ] — E[X] 2 , 
where the kernel function is taken into account. 

One problem associated with kernel smoothing techniques in general is the width of 
the kernel that in this case is defined by the parameter 0. Small values of /3 have the effect 
that the prefix score a of a pair (a, b) G Xh must be very close to Sh for the full score b 
to contribute anything to the final estimates. Larger values have the opposite effect, even 
points that are far away from Sh will influence the estimates. Selecting an appropriate 
width for the kernel is not trivial. We observed that setting (3 to one 5th of the standard 
deviation of the prefix scores for h gives good results in practice. 

While this technique lets us avoid some of the problems related to the binning ap- 
proach, it comes at a fairly high computational cost. We have to evaluate the kernel n 
times to get estimates for /x(s^) and a(sh) for one Sh- These estimates must be computed 
potentially for every possible prefix of every row in X. This results in 0(n 2 m) calls to 
K(x, y) for one single query (assuming both the training data and the input matrix have 
n rows), which clearly does not scale. Hence, we introduce approximate estimators for 
fj,(sh) and cr(sh) that are based on simple linear regression models. This way we do not 
need to evaluate K (x, y) at query time at all. We let 



K(x,y) = e p 



(5) 



E(q,&)g* h K ( a > S h) h 



(6) 




K s h) ~ Ql s h + 5o, 
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and 



(9) 
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The parameters qfi, q±, q£, and q° are the standard estimates for linear regression coeffi- 
cients given the sets 

T M = {(X i , 1:ft Wi :ft , M(X<,ls/ l Wi !h )}^ 1 , (10) 

and 

T CT = {(X, l:/ .w l:/ .. a(X iil:/l w 1:/l )}7 =1 , (11) 

where ju(Xi ) i : / l Wi : / l ) and cr(Xj ) i : / l Wi : ^) are based on equations [6] and El respectively. We 
thus compute the kernel estimates only for the training data. Given T M and T a we learn 
linear functions that are used at query time to estimate the parameters of the normal 
distribution that we assume the full scores are following. 

Our method for estimating the probability Pr(xw T > 5 | Xi^w^j can be summarized 
as follows: 

1. Given a training data (a matrix X with all entries known), compute for each row 
the full score, and associate this with the prefix scores for each possible prefix length 
h. That is, for each h compute the set as defined in Equation HI 

2. Using the definitions for \i[sh) and cr(sh) given in equations [6] and [TJ compute the 
sets T M and T a defined in equations [10] and [TTJ respectively. 

3. Learn the models in equations [8] and [9] by fitting a regression line to the points in 
T M and T a , respectively. 

4. At query time, use the cumulative density function of iV^X^i^w^J, ^(X^i^w^)) 
to estimate the probability of Xj.w T being larger than 5. 

5 Parameter selection 

In this section we discuss systematic methods for choosing the parameters required by 
the algorithm presented above. 

5.1 Choosing the right a 

We start by describing a method for learning an "optimal" value of a given training 
data X. This can be very useful, since setting the value of a too low will decrease 
the performance of Algorithm [1] in terms of the cost. When a increases, the algorithm 
will clearly prune more items. This leads both to a lower cost and a lower accuracy. 
Conversely, when alpha decreases, the accuracy of the method increases, and so does the 
cost as less items are being pruned. The definitions of accuracy and cost in equations 
[2] and [U respectively, thus depend on a. We denote by accfc(X,a) and costfc(X,a) the 
accuracy and cost attained by the PR algorithm for a given value of a. 

Due to the trade-off between cost and accuracy, we should set a as high (or low) as 
possible without sacrificing too much in accuracy (or cost). While a very conservative 
estimate for a, say 0.001, is quite likely to result in a high accuracy, it can perform sub- 
optimally in terms of the cost. Maybe with a = 0.05 we obtain an almost equally high 
accuracy at only a fraction of the cost. 

Consider a coordinate system where we have accuracy on the x-axis and cost on the 
y-axis. In an ideal setting we would have a accuracy of 1 at zero cost, represented by 
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the point at (1,0) on this accuracy-cost plane. Obviously this is not attainable in reality, 
since we always have to inspect some of the entries of X, and this will lead to a nonzero 
cost. But we can still define the optimal a in terms of this point. 

Definition 1 Let 

distfc(X, a) = ||(acc fc (X, a),cost fe (X, a)) - (1,0)||. 
The optimal a* given the matrix X satisfies 

a* = arg min dist fe (X, a), 

ae[0,l] 

where \ \ ■ \ \ denotes the Euclidean norm. 

That is, we want to find an a that minimizes the distance to the point (1, 0) on the 
accuracy-cost plane. Clearly this is a rather simple definition. It assigns equal weight 
to accuracy and cost, even though we might prefer one over the other, depending on the 
application. However, modifying the definition to take such requirements into account is 
easy. 

Next we discuss how to find a*. In the definition we state that it has to belong to the 
interval [0, 1]. However, first we observe that there exists an interval [q m j^ q mi ], so that 
when a < a m i n we have accfc(X, a) = 1, and when a > a max we have accfc(X, a) = 0. 
Clearly the the interesting a in terms of Definition [1] lies in [a m i n , a max ]. We can analyze 
the values in this interval even further. Consider the following set of possible values for 
a: 

Q(X) = {minPr(xw T > 5 | x 1:/l w^)} xeT , {x) , (12) 

where 5 = min a , eT fc( X ){xw T }. That is, for each x G T^(X), Q(X.) contains the value a 
so that when a > a, Algorithm [1] will prune x. More precisely, if we order the values 
in Q(X) in ascending order, and let denote the ith value in this order, we know that 
when a G [a*, a i+ i) the algorithm will prune exactly i rows of the correct top-fc set of 
X. (Assuming that all cij are different.) By letting a vary from a x = a min to a k < a max , 
accfc(X, a) decreases from 1 to 1/k in steps of 1/k. Likewise, costfc(X, a) decreases as 
a increases. Now we can systematically express costfc(X, a) as a function of accfc(X, a), 
since each a G Q(X) is associated with a certain accuracy. 

This makes finding the optimal a easy. We solve the optimization problem of Defini- 
tion [T] by only considering values in Q(X). In fact, we can show that an a* obtained this 
way is the same as the one we would obtain by having the interval [0, 1] as the feasible 
region. 

Lemma 1 Let a* = argmin Qg [ 0j i] distfc(X, a). We have a* G Q(X), where <5(X) is 
defined as in Equation ITM 

Proof We show that for all as that lie between any two adjacent values in Q(X), the 
distance distfc(X, a) is larger than when a is chosen from Q(X). Consider any a« and 
a i+ i in Q(X). We show that within the interval [aj,a i+1 ] the distance distfc(X, a) is 
minimized for either a = a« or a = Ot+i. As a increases from to a« + e for some 
small e > 0, accfc(X, a) decreases by 1/k, and distfc(X, a) increases by (distfc(X, + e) — 
distfc(X, dj)) = Ai > 0. When we further increase a from a« + e to a i+ i, accfc(X, a) 
stays the same, but costfc(X, a) may decrease. Therefore, distfc(X, a) decreases until 
a = a i+ i. We let (distfc(X, a, + e) — distfc(X, a i+ i)) = A 2 > 0. If A x > A 2 , we have 
distfc(X, a,i) < distfc(X, a i+ i), otherwise distfc(X, a«) > distfc(X, aj + i). 
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5.2 Choosing a schedule 

So far we have not considered the order, the schedule, in which the columns of X should 
be processed. This order has a considerable impact on the performance of the algorithms. 
Processing the attributes in a certain order will lead to a tighter upper bound on the 
full score in case of the UB and MP algorithms. With the PR algorithm the probability 
estimates will be more accurate with some permutations of the attributes than others. A 
similar problem was considered in [T3] for the MP algorithm. The approach is different, 
however, as training data is not used and an optimal schedule must be found at query- 
time. 

5.2.1 Baseline schedules 

Given the ranking vector w, and the cost C(AA for each attribute Ai, we consider four 
simple baselines for the schedule: 

A: Read the attributes in random order. This is the simplest possible way of choosing 
a schedule. We pick a random total order of m items uniformly from the set of all 
permutations and use this as the schedule. 

B: Read the attributes in decreasing order of the absolute values in w. This can be 
motivated by the fact that attributes with a larger weight (the important attributes) 
will have a bigger impact on the full score w T Xj.. In some cases we might have a 
fairly accurate estimate of w T x already after a very short prefix of the row Xj. 
has been computed This in turn will lead to better pruning, since the estimates of 
the probability Pr(w T Xj. > 5 | w^X^i.^J are more accurate. The downside of 
this approach is that the costs are not taken into account. It is possible that the 
important attributes have almost the same absolute value in w, but considerably 
different costs. 

C: Read the attributes in increasing order of the cost C(A{). This is based on the 
assumption that by computing the "cheap" features first, we might be able to prune 
objects without having to look at the expensive attributes at all. Bowever, this time 
we may end up computing a long prefix of Xj., because it is possible that some of 
the "cheap" attributes have a low weight in w, and thereby do not contribute so 
much to the full score. 

D: Read the attributes in decreasing order of the ratio \wi\/C(Ai). By this we try to 
remedy the downsides of the previous two approaches. The value of an attribute is 
high if it has a large weight in w, and a small cost. Conversely, attributes with a 
small weight and a high cost are obviously less useful. 

5.2.2 Learning a schedule from training data 

In addition to the baselines above, we can also try to find a schedule by using available 
information. In general we want to find a schedule that minimizes the cost of finding the 
top- A; set in the training data. One difficulty here is the selection of a. The cost of a 
given schedule ip depends on the value of a, and the optimal schedule might be different 
for different values of a. One option would be to fix a in advance. Bowever, we want to 
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avoid this, because the a we use for finding the schedule might be different from the a 
that is used when running Algorithm [TJ (After learning the schedule ip, we can use the 
method described in Section I5TTI to find an optimal value of a given 0.) Another option 
would be to simultaneously learn an optimal schedule ip and the optimal a. This does 
not seem trivial, however. Instead, we take an approach where we try to find a schedule 
that is good independent of the final choice of a. 

If a were fixed, we could define a cost for the schedule tp in terms of acCfc(X, a) and 
costfc(X, a). However, instead of considering a particular value of a, we define the cost as 
a sum over all possible meaningful values of a. Recall that the set Q(X) (see Equation IT2|) 
contains all "threshold" values so that when a crosses these, accfc(X, a) decreases by 1/k. 
We define the cost of the schedule ip given X as 

cost fc (X, V>) = ^2 c ost k (X(ip),a), (13) 

aeQ(XW>)) 

where X(-0) denotes the matrix X with permutation ip applied to its columns. Note 
that costfc(X, ij)) can be interpreted as the "area" below the curve of costfc(X(-0), a) in the 
accuracy-cost plane for a G Q(X). For example, if the curve corresponding to permutation 
■0 is below the curve corresponding to if)' ^ if), we know that independent of a, the schedule 
■0 always has a smaller cost for the same value of acc^X^), a). The score in Equation [TBI 
is a heuristic that attempts to capture this intuition. The scheduling problem can thus 
be expressed as follows: Given an integer k and the matrix X, find the schedule 

■0* = argmin{costfc(X, -0)}, 

where cost(X, if)) is defined as in Equation [131 In this paper we propose a simple greedy 
heuristic for learning a good schedule. Denote by a partial schedule a prefix of a full 
schedule. The algorithm works by adding a new attribute to an already existing partial 
schedule. The attribute that is added is the best one among all possible alternatives. 

Since we're dealing with partial schedules that are prefixes of a full schedule, we can 
not evaluate costfc(X, ip) exactly as defined above. This is because some rows are not 
pruned by looking only at their prefix. However, they may be pruned at some later stage 
given a longer prefix. When evaluating a partial schedule, we assume that any row that 
is not pruned incur the full cost. That is, we must read all of their attributes before 
knowing whether or not they belong to the top-fc set. This means that the cost of a prefix 
of the final schedule is an upper bound for the cost of the full schedule. More formally, 
we denote the upper bound by |~cost(X, if>)~\ , and let 

n 

[cost(X, -0)] = cost(Xj., ip,a), 

aeQ(XW>)) 8=1 

where the row-specific cost is 

t, i ^ J sr=iC(A-) if/( X) 0,«)=0, 

C ° St(X ' *> a) = { Eft^ C(A m ) otherwise. (M) 

Above J(x, -0, a) is the index of the first attribute (according to -0) that will prune the 
remaining attributes of x, that is, 

J(x,V,a) = min{/i | Pr(xw T > 5 | x v , (1:ft) wj (1:?i) ) < a}. 
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Algorithm 2 

Input: the matrix X, the set of attributes A 
Output: a permutation ip of A 

1: ^ <- 

2: while A ^ do 

3: A' <- arg minAe^ [cost ( [if>A] , X)] 

4: if) <r- [lf)A'] 

5: 

6: end while 

7: return ?/> 



For convenience we let min{0} = 0. Equation simply states that the cost of a row 
is the sum of all attribute costs if the row is not pruned, otherwise we only pay for 
the attributes that are required to prune the row. The scheduling algorithm, shown in 
Algorithm |2j always appends the attribute to the prefix that minimizes the upper bound 
|~cost(X, if>)~\ . We denote by [if) A] the permutation if) appended with A. 

6 Experiments 

In the experiments that follow we compare the performance of our proposed method with 
the baseline algorithms using different schedules. Our basic criteria for evaluation are the 
cost and accuracy measures. We with to remind the reader that our notion of accuracy is 
not a measure of relevance, but simply a comparison with the exact top-k set. In addition 
to the baselines described earlier, it is good to compare the numbers with a sampling 
approach, where we randomly select, say, 50 percent of the rows of the matrix, and run 
the trivial algorithm on this. This will have a cost of 0.5, and also the expected accuracy 
will be 0.5. Any reasonable algorithm should outperform this. 

The upper bounds for attribute values used by the UB and MP algorithms are based 
on training data as well. The upper bound for attribute Aj is the largest value of Aj 
observed in the training data. We acknowledge that this is a rather rudimentary approach, 
but we want to study how these algorithms perform under the same conditions as the 
PR algorithm. In each of the tables that follow, the numbers in parenthesis denote the 
standard deviation of the corresponding quantity. 

6.1 Datasets 

We conduct experiments on both artificial and real data. Random data is generated by 
sampling each X^ from a normal distribution with zero mean and a unit variance. To 
enforce that Xy G Rq we replace each entry with its absolute value. In every experiment 
we use one random X as the training data, and another random X as the test data. The 
results are averages over a number of such training-testing pairs. Also, the vector w and 
the costs C(Aj) are chosen uniformly at random from the interval [0, 1]. 

The real data consists of a set of queries from a context sensitive Wikipedia search 
engine [27] . For each query q we have the matrix X 9 where each row corresponds to a 
document that contains the query term(s). The documents are represented by 7 features. 
We split the data randomly to a training and test part. The training data consists of 25 
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Table 1: Estimated attribute costs and their scoring weights for the Wikipedia data. 





BM25 


GPR 


TXT 


succ 


PRED 


SPOT 


LPR 


w j 


1.43 
0.047 


2.23 
0.003 


10.02 
0.636 


5.49 
0.479 


4.06 
0.353 


5.42 
0.008 


1.72 
0.588 



matrices, each corresponding to a set of documents matching a different query. The test 
data consists of 100 matrices, each again corresponding to a different query. (There is no 
overlap between queries in the training data and the test set.) The training part is used 
to learn the weight vector w as described in (27]. Also the algorithms for finding a good 
schedule and optimizing the value of a are run on the training data. 

The attribute costs C(Aj) for the Wikipedia example were measured by computing 
features for 400 queries. For each query the result set is restricted to 1000 topmost 
documents according to one of the features (bm25). In every case we measure the time 
spent computing each feature. The costs shown in Tabled] are logarithms of the averages of 
these. The numbers are not intended to be fully realistic, but we consider them reasonable 
for the purposes of this paper. 

6.2 Schedule comparison 

First we compare the different schedule selection heuristics. With MP and UB we only use 
the baseline schedules A, B, C, and D. In case of the PR algorithm we also study how 
a schedule learned using the method described in Section 15.21 compares to the baselines. 
With the PR algorithm we use the method described in Section 15.11 to learn a good value 
of a. We also study the effect of the heuristic described in Section 13.11 That is, do we 
gain anything by reordering the rows of X in decreasing order of the value of the first 
attribute in the schedule before running the algorithms. Note that this affects only the 
UB and PR algorithms. The MP algorithm has this heuristic built-in as the next element 
of X it reads is selected from a priority queue that is initialized with the upper bounds 
based on only the first feature. 

Upper part of Table |2] shows the average cost for each algorithm and schedule for 
k = 10 over 50 random inputs when the row reordering heuristic is in use. As can be 
seen, the PR algorithm outperforms both UB and MP by a clear margin independent of 
the choice of the schedule. When comparing the schedules, both D (the weight-cost ratio 
heuristic) and a learned schedule outperform the others. The difference between D and a 
learned schedule is very small. The bottom part of Table |2] shows the same quantities for 
the UB and PR algorithms when the rows of the input matrix are not sorted in decreasing 
order of the value on the first attribute (according to the used schedule). Clearly both 
algorithms perform considerably worse in this case. Hence, with random data the row 
reordering heuristic is useful. 

The accuracies for random data are shown in Table [3j Also here the upper and lower 
parts of the table show results with and without the row reordering heuristic, respectively. 
In general there is a correlation between cost and accuracy; the more entries of the matrix 
you inspect, the more accurate are the results. In terms of accuracy the UB algorithm gives 
the best results, with nearly 100 percent accuracy in almost every case. The PR algorithm 
has an average accuracy of 0.85 with schedule D, which is a very good result considering 
that the algorithm inspected on average only 23 percent of the entries of X. With the 
learned heuristic accuracy drops to 0.81, however. 
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Table 2: Costs for different schedules using random data (k = 10) with (top) and without 
(bottom) the row reordering heuristic. 





A 


B 


C 


D 


learned 


UB 
MP 
PR 


0.83 (0.12) 
0.69 (0.14) 
0.44 (0.17) 


0.85 (0.09) 
0.69 (0.15) 
0.25 (0.10) 


0.88 (0.09) 
0.60 (0.14) 
0.31 (0.15) 


0.88 (0.07) 
0.66 (0.13) 
0.23 (0.10) 


0.22 (0.10) 


UB 
PR 


0.86 (0.10) 
0.56 (0.14) 


0.89 (0.08) 
0.37 (0.10) 


0.91 (0.07) 
0.43 (0.12) 


0.91 (0.06) 
0.35 (0.09) 


0.34 (0.09) 



Table 3: Accuracies for different schedules using random data {k = 10) with (top) and 
without (bottom) the row reordering heuristic. 





A 


B 


C 


D 


learned 


UB 
MP 
PR 


0.87 (0.21) 
0.55 (0.21) 
0.84 (0.14) 


0.99 (0.01) 
0.88 (0.10) 
0.84 (0.14) 


0.99 (0.01) 
0.62 (0.20) 
0.84 (0.14) 


1.00 (0.00) 
0.88 (0.11) 
0.85 (0.16) 


0.81 (0.14) 


UB 
PR 


0.89 (0.19) 
0.90 (0.09) 


0.99 (0.01) 
0.91 (0.08) 


0.99 (0.01) 
0.89 (0.10) 


1.00 (0.00) 
0.90 (0.10) 


0.88 (0.10) 



Table 4: Costs for different schedules using Wikipedia data {k = 10) with (top) and 
without (bottom) the row reordering heuristic. 





A 


B 


C 


D 


learned 


UB 
MP 
PR 


1.00 (0.00) 
0.95 (0.00) 
0.43 (0.27) 


1.00 (0.00) 
0.81 (0.06) 
0.43 (0.09) 


1.00 (0.00) 
0.67 (0.00) 
0.64 (0.28) 


0.91 (0.05) 
0.82 (0.00) 
0.43 (0.22) 


0.66 (0.32) 


UB 
PR 


0.76 (0.20) 
0.50 (0.24) 


0.99 (0.01) 
0.50 (0.09) 


0.62 (0.23) 
0.71 (0.24) 


0.99 (0.01) 
0.50 (0.20) 


0.74 (0.25) 



Table 5: Accuracies for different schedules using Wikipedia data (k = 10) with (top) and 
without (bottom) the row reordering heuristic. 





A 


B 


C 


D 


learned 


UB 
MP 
PR 


1.00 (0.00) 
0.76 (0.20) 
0.63 (0.30) 


1.00 (0.00) 
0.99 (0.01) 
0.81 (0.22) 


1.00 (0.00) 
0.62 (0.23) 
0.81 (0.25) 


1.00 (0.00) 
0.99 (0.01) 
0.83 (0.23) 


0.83 (0.23) 


UB 
PR 


1.00 (0.00) 
0.64 (0.29) 


1.00 (0.00) 
0.84 (0.20) 


1.00 (0.00) 
0.82 (0.25) 


1.00 (0.00) 
0.83 (0.24) 


0.85 (0.21) 
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Table 6: Cost (top) and accuracy (middle) in random data with the PR algorithm for 
different a. 





A 


B 


C 


D 


learned 


a*/2 

a* 

2a* 


0.46 (0.19) 
0.40 (0.18) 
0.34 (0.16) 


0.26 (0.10) 
0.24 (0.09) 
0.22 (0.08) 


0.37 (0.13) 
0.32 (0.12) 
0.25 (0.10) 


0.27 (0.11) 
0.24 (0.10) 
0.20 (0.08) 


0.27 (0.11) 
0.23 (0.10) 
0.19 (0.08) 


a* 12 

a* 

2a* 


0.87 (0.10) 
0.82 (0.14) 
0.74 (0.16) 


0.88 (0.11) 
0.84 (0.13) 
0.79 (0.15) 


0.90 (0.09) 
0.85 (0.11) 
0.75 (0.14) 


0.91 (0.09) 
0.86 (0.11) 
0.80 (0.14) 


0.88 (0.11) 
0.84 (0.13) 
0.75 (0.17) 


91 
St 


0.92 
1.06 


0.97 
1.02 


0.91 
1.13 


0.94 
1.11 


0.89 
1.08 



Table 7: Cost (top) and accuracy (middle) in Wikipedia with the PR algorithm for 
different a. 





A 


B 


C 


D 


learned 


a* 12 

a* 

2a* 


0.46 (0.30) 
0.30 (0.24) 
0.19 (0.01) 


0.47 (0.10) 
0.42 (0.08) 
0.37 (0.05) 


0.73 (0.24) 
0.60 (0.27) 
0.28 (0.29) 


0.55 (0.23) 
0.48 (0.22) 
0.35 (0.22) 


0.73 (0.28) 
0.63 (0.30) 
0.42 (0.29) 


a* 12 

a* 

2a* 


0.81 (0.21) 
0.62 (0.27) 
0.37 (0.25) 


0.89 (0.17) 
0.82 (0.21) 
0.69 (0.23) 


0.88 (0.21) 
0.79 (0.24) 
0.46 (0.33) 


0.92 (0.18) 
0.88 (0.22) 
0.76 (0.27) 


0.92 (0.16) 
0.84 (0.21) 
0.64 (0.28) 


34- 


0.85 


0.97 


0.92 


0.91 


0.95 


9t 


0.94 


0.96 


1.25 


1.18 


1.14 



Cost and accuracy for the Wikipedia data are shown in Tables H] and EJ respectively. 
The numbers are averages over 100 different queries that belong to the test set. In terms 
of the cost the PR algorithm is again a clear winner. The best schedules for PR are A, 
B, and D, with the learned schedule having problems. When accuracy is considered, we 
observe that schedule A performs considerably worse than the others. Overall the best 
choice is D (order attributes in decreasing order of the ratio Wj/C(Aj)), however. With 
this schedule the PR algorithm attains a accuracy of 0.83 and pays only 43 percent of the 
maximum cost. As with random data, the costs increase for PR when the row reordering 
heuristic is not used. Interestingly UB performs better with schedules A and C without 
row reordering. In fact, with schedule C the UB algorithm attains a rather nice result by 
having a accuracy of 1.00 with an average cost of 0.62. 

6.3 Sensitivity to the parameter a 

We continue by studying the sensitivity of the PR algorithm to the value of a. A method 
for selecting a good value of a was proposed in Section 15.11 We compare this value, 
denoted a*, with the values 2a* and \oi* . In addition to the actual values of cost and 
accuracy, we also show two other quantities, denoted and g^. These indicate the ratio 
of the relative change in accuracy to the relative change in the cost when a is halved or 
doubled, respectively. We let g^ = ^"*^/^ a * , and g^ = . When < 1 the relative 

increase in accuracy is less than the relative increase in costs. Respectively, when > 1 
the relative decrease in accuracy is larger than the relative decrease in costs. On the other 
hand, when either g± > 1 or g^ < 1 it would be more efficient to use a*/2 or 2a* instead 
of a*. 

Results for random data are shown in Table El Costs are shown in the top part of 
the table, while accuracy is shown in the middle part. As expected, halving (doubling) 
the value of a* increases (decreases) both cost and accuracy. However, as indicated by g± 



18 



Table 8: Costs (top) and accuracies (bottom) for the algorithms when C(Aj) = Wj for 
k = 10. 





A 


B 


C 


D 


learned 


UB 
MP 
PR 


0.86 (0.04) 
0.72 (0.07) 
0.47 (0.10) 


0.81 (0.03) 
0.78 (0.03) 
0.37 (0.05) 


0.91 (0.04) 
0.64 (0.05) 
0.62 (0.12) 


0.85 (0.04) 
0.72 (0.07) 
0.46 (0.11) 


0.44 (0.10) 


UB 
MP 
PR 


1.00 (0.00) 
0.69 (0.23) 
0.82 (0.12) 


1.00 (0.00) 
0.92 (0.10) 
0.83 (0.13) 


0.99 (0.03) 
0.36 (0.15) 
0.80 (0.15) 


0.99 (0.01) 
0.67 (0.23) 
0.82 (0.14) 


0.78 (0.17) 



and g^, the increase (decrease) in accuracy is never large (small) enough to warrant the 
corresponding increase (decrease) in the cost. Table [7] shows the results for Wikipedia. 
The behavior is the same as with random data, with the exception that now is below 
1 for schedules A and B, indicating that in this case the relative gain in decreased cost 
is larger than the relative loss in decreased accuracy. Indeed, using schedule B (rank 
attributes in decreasing order of the absolute value of Wj) with a is set to 2a* we obtain 
an average cost of 0.37 with an average accuracy of 0.69, which can still be considered a 
reasonable performance. 

6.4 Correlated weights and costs 

This experiment is only ran using random data. We want to study how the relationship 
of w and C(Aj) affects the performance of the algorithms. We are interested in the case 
where the most important attributes according to w, i.e. those with the highest absolute 
values, also have the highest costs. In this case the baseline schedules B and C (see 
Section 15.21) disagree as much as possible. The experiment is ran in the same way as 
the one in Section fo.2\ with the exception that we let C(Aj) = Wj. The row reordering 
heuristic is being used. 

Results are shown in Table [HJ Costs are given in the upper part of the table, while 
accuracies are shown in the lower part. Clearly the PR algorithm still outperforms both 
baselines with every schedule. However, when the numbers are compared with those in 
tables [2] and [3], we observe a noticeable decrease in performance. The average costs of C, 
D, and the learned schedule are twice as high when the most important features also have 
the highest costs. But even now the average cost of a query is less than 50 percent of the 
full cost with the PR algorithm. 

6.5 Effect of k 

The performance of the algorithms may also depend on the size of the top- A; set. For 
smaller k we expect the pruning to be more efficient, as the threshold 5 is larger. In 
addition to k = 10 that was used in the previous experiments, we also run the algorithms 
with k = 5 and k = 20 to see how this affects the results. In this test we only consider 
the weight-cost ratio heuristic (D) for the schedule. 

Table shows results for random data. Clearly the cost increases as k increases. 
Especially for the PR algorithm the effect is considerable. However, accuracy is not really 
affected for any of the algorithms. Results for Wikipedia are shown in Table [101 Here we 
do not see any significant effect on either the cost or accuracy. 
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Table 9: Costs (top) and accuracies (bottom) with random input, schedule D and different 
k. 





k = 5 


k = 10 


k = 20 


UB 


0.85 (0.08) 


0.88 (0.07) 


0.93 (0.05) 


MP 


0.64 (0.17) 


0.66 (0.13) 


0.68 (0.12) 


PR 


0.19 (0.09) 


0.23 (0.10) 


0.29 (0.10) 


UB 


1.00 (0.00) 


1.00 (0.00) 


1.00 (0.00) 


MP 


0.87 (0.20) 


0.88 (0.11) 


0.89 (0.12) 


PR 


0.87 (0.14) 


0.85 (0.16) 


0.86 (0.09) 



Table 10: Costs (top) and accuracies (bottom) with Wikipedia, schedule D and different 
k. 





k = 5 


k = 10 


k = 20 


UB 


0.83 (0.00) 


0.91 (0.05) 


0.84 (0.00) 


MP 


0.82 (0.00) 


0.82 (0.00) 


0.83 (0.00) 


PR 


0.41 (0.22) 


0.43 (0.22) 


0.43 (0.22) 


UB 


0.98 (0.05) 


1.00 (0.00) 


0.99 (0.01) 


MP 


0.99 (0.03) 


0.99 (0.01) 


1.00 (0.00) 


PR 


0.83 (0.28) 


0.83 (0.23) 


0.79 (0.25) 



7 Conclusion and future work 

We have discussed an algorithm for approximate top- A; search in a setting where the input 
relation is initially hidden, and its elements can be accessed only by paying a (usually 
computational) cost. The score of a row is defined as its inner product with a scoring 
vector. The basic task is to find an approximate top-fc set while keeping the total cost of 
the query low. Although we consider linear scoring functions in this paper, the proposed 
approach should yield itself also to other types of of aggregation functions. 

Since the contents of the relation are unknown before any queries are issued, indexing 
its contents is not possible. This is a key property of our setting that differentiates it 
from most of existing literature on top-A; as well as A>NN search. Instead we have access 
to training data from the same distribution as the hidden relation. The algorithm we 
propose is based on the use of this training data. Given the partial score of an item, the 
algorithm estimates the probability that the full score of the item will be high enough for 
the item to enter the current top- A; set. The estimator for this probability is learned from 
training data. The algorithm has two parameters. We also propose methods for learning 
good values for these from training data. The experiments indicate that our proposed 
algorithm outperforms the baseline in terms of the cost by a considerable margin. While 
the MPro [H] algorithm attains a very high accuracy it does this at a high cost. 

The work presented in this paper is mostly related to databases and approximate 
nearest neighbor search. However, we also want to point out some connections to classi- 
fication problems, and especially feature selection. Our approach can be seen as a form 
of dynamic feature selection for top- A; problems with the aim to reduce the overall cost 
of the query. Similarly we can consider cost- sensitive classification (see e.g. [5]), where 
the task is to classify a given set of items while keeping the total cost as low as possible. 
Based on a subset of the available features the classifier makes an initial prediction, and 
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if this prediction is not certain enough, we read the value of a yet unknown feature and 
update the prediction accordingly. Decision trees already implement this principle in a 
way, but it might be interesting to extend it to other classification algorithms, such as 
SVMs. 

Conversely, a potentially interesting approach to extending the work of this paper 
is to replace the linear schedule with something more complex, such as a decision tree. 
In this case the next attribute to be read would depend on the value (s) of the previous 
attribute(s). The results of [21 EH El EE] might provide a fruitful starting point for studying 
the theoretical properties of the problem. Further studies include the use of more complex 
models than linear regression for estimating /i and a. Also, using other distributions than 
a Gaussian for the full score given a prefix score may be of interest. 
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